You should have R installed – if not: Open a web browser and go to http://cran.r-project.org and download and install it.
Also helpful to install RStudio (download from http://rstudio.com)
In R, type install.packages("ggplot2") to install the ggplot2 package.
cpi <- read.csv("Datasets/CPI-data.csv")
head(cpi[1:5])
## X2016.Rank Country X2016.Score X2015.Score X2014.Score
## 1 1 Denmark 90 91 92
## 2 1 New Zealand 90 88 91
## 3 3 Finland 89 90 89
## 4 4 Sweden 88 89 87
## 5 5 Switzerland 86 86 86
## 6 6 Norway 85 87 86
housing <- read.csv("Datasets/landdata-states.csv")
head(cpi[1:5])
## X2016.Rank Country X2016.Score X2015.Score X2014.Score
## 1 1 Denmark 90 91 92
## 2 1 New Zealand 90 88 91
## 3 3 Finland 89 90 89
## 4 4 Sweden 88 89 87
## 5 5 Switzerland 86 86 86
## 6 6 Norway 85 87 86
Compared to base graphics, ggplot2
is more verbose for simple / canned graphics is less verbose for complex / custom graphics does not have methods (data should always be in a data.frame) uses a different system for adding plot elements
Base graphics histogram example:
hist(housing$Home.Value)
ggplot2 histogram example:
library(ggplot2)
ggplot(cpi, aes(x = X2016.Score)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
ggplot(housing, aes(x = Home.Value)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Clearly Base graphics histogram looks better and cleaner. Therefore Base graph wins!
Base graphics histogram example:
plot(Home.Value ~ Date,
data=subset(housing, State == "MA"))
points(Home.Value ~ Date, col="red",
data=subset(housing, State == "TX"))
legend(1975, 400000,
c("MA", "TX"), title="State",
col=c("black", "red"),
pch=c(1, 1))
ggplot2 color scatter plot example:
ggplot(subset(housing, State %in% c("MA", "TX")),
aes(x=Date, y=Home.Value, color=State)) +
geom_point()
Clearly ggplot2 color scatter plot looks better and cleaner. Therefore ggplot2 graph wins!
In ggplot land aesthetic means “something you can see”. Examples include:
Geometric objects are the actual marks we put on a plot. Examples include:
You can get a list of available geometric objects using the code below:
help.search("geom_", package = "ggplot2")
or simply type geom_
Now that we know about geometric objects and aesthetic mapping, we can make a ggplot. geom_point requires mappings for x and y, all others are optional.
hp2001Q1 <- subset(housing, Date == 2001.25)
ggplot(hp2001Q1, aes(y = Structure.Cost, x = Land.Value)) +
geom_point()
For a better view of the underlying structure of the data - use log.
ggplot(hp2001Q1, aes(y = Structure.Cost, x = log(Land.Value))) +
geom_point()
A plot constructed with ggplot can have more than one geom. In that case the mappings established in the ggplot() call are plot defaults that can be added to or overridden. Our plot could use a regression line:
hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))
p1 + geom_point(aes(color = Home.Value)) +
geom_line(aes(y = pred.SC))
Not all geometric objects are simple shapes–the smooth geom includes a line and a ribbon. Note: geom_smooth() by default uses method = ‘loess’
p1 +
geom_point(aes(color = Home.Value)) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
Each geom accepts a particualar set of mappings–for example geom_text() accepts a labels mapping.
p1 +
geom_text(aes(label=State), size = 3)
But what if you want to see both points and text labels?
## install.packages("ggrepel")
library("ggrepel")
p1 +
geom_point() +
geom_text_repel(aes(label=State), size = 3)
Note that variables are mapped to aesthetics with the aes() function, while fixed aesthetics are set outside the aes() call. This sometimes leads to confusion, as in this example:
p1 +
geom_point(aes(size = 2), # incorrect! 2 is not a variable
color="red") # this is fine -- all points red
Other aesthetics are mapped in the same way as x and y in the previous example.
p1 +
geom_point(aes(color=Home.Value, shape = region))
## Warning: Removed 1 rows containing missing values (geom_point).
The data for the exercises is available in the dataSets/EconomistData.csv file. Read it in with
dat <- read.csv("dataSets/EconomistData.csv")
head(dat)
## X Country HDI.Rank HDI CPI Region
## 1 1 Afghanistan 172 0.398 1.5 Asia Pacific
## 2 2 Albania 70 0.739 3.1 East EU Cemt Asia
## 3 3 Algeria 96 0.698 2.9 MENA
## 4 4 Angola 148 0.486 2.0 SSA
## 5 5 Argentina 45 0.797 3.0 Americas
## 6 6 Armenia 86 0.716 2.6 East EU Cemt Asia
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) + geom_point()
dat <- read.csv(“dataSets/EconomistData.csv”)
Original sources for these data are http://www.transparency.org/content/download/64476/1031428 http://hdrstats.undp.org/en/indicators/display_cf_xls_indicator.cfm?indicator_id=103106&lang=en
These data consist of Human Development Index and Corruption Perception Index scores for several countries.
dat <- read.csv("dataSets/EconomistData.csv")
head(dat)
## X Country HDI.Rank HDI CPI Region
## 1 1 Afghanistan 172 0.398 1.5 Asia Pacific
## 2 2 Albania 70 0.739 3.1 East EU Cemt Asia
## 3 3 Algeria 96 0.698 2.9 MENA
## 4 4 Angola 148 0.486 2.0 SSA
## 5 5 Argentina 45 0.797 3.0 Americas
## 6 6 Armenia 86 0.716 2.6 East EU Cemt Asia
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point()
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(color="blue")
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(aes(color=Region))
ggplot(dat, aes(x = CPI, y = HDI, size = 2)) +
geom_point(aes(color=Region))
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) +
geom_point(aes(color=Region))
ggplot(dat, aes(x = Region, y = CPI)) +
geom_boxplot()
ggplot(dat, aes(x = Region, y = CPI)) +
geom_boxplot() +
geom_point()
Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:
Each geom has a default statistic, but these can be changed. For example, the default statistic for geom_bar is stat_bin:
args(geom_histogram)
## function (mapping = NULL, data = NULL, stat = "bin", position = "stack",
## ..., binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA,
## inherit.aes = TRUE)
## NULL
args(stat_bin)
## function (mapping = NULL, data = NULL, geom = "bar", position = "stack",
## ..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL,
## breaks = NULL, closed = c("right", "left"), pad = FALSE,
## na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
## NULL
Arguments to stat_ functions can be passed through geom_ functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.
For example, here is the default histogram of Home.Value:
p2 <- ggplot(housing, aes(x = Home.Value))
p2 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
reasonable by default, but we can change it by passing the binwidth argument to the stat_bin function:
p2 + geom_histogram(stat = "bin", binwidth=4000)
Sometimes the default statistical transformation is not what you need. This is often the case with pre-summarized data:
housing.sum <- aggregate(housing["Home.Value"], housing["State"], FUN=mean)
rbind(head(housing.sum), tail(housing.sum))
## State Home.Value
## 1 AK 147385.14
## 2 AL 92545.22
## 3 AR 82076.84
## 4 AZ 140755.59
## 5 CA 282808.08
## 6 CO 158175.99
## 46 VA 155391.44
## 47 VT 132394.60
## 48 WA 178522.58
## 49 WI 108359.45
## 50 WV 77161.71
## 51 WY 122897.25
## ggplot(housing.sum, aes(x=State, y=Home.Value)) +
## geom_bar()
##Error: stat_count() must not be used with a y aesthetic.
What is the problem with the previous plot? Basically we take binned and summarized data and ask ggplot to bin and summarize it again (remember, geom_bar defaults to stat = stat_count); obviously this will not work. We can fix it by telling geom_bar to use a different statistical transformation function:
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar(stat="identity")
geom_smooth.geom_smooth, but use a linear model for the predictions. Hint: see ?stat_smooth.geom_line. Hint: change the statistical transformation.?loess.dat <- read.csv("dataSets/EconomistData.csv")
head(dat)
## X Country HDI.Rank HDI CPI Region
## 1 1 Afghanistan 172 0.398 1.5 Asia Pacific
## 2 2 Albania 70 0.739 3.1 East EU Cemt Asia
## 3 3 Algeria 96 0.698 2.9 MENA
## 4 4 Angola 148 0.486 2.0 SSA
## 5 5 Argentina 45 0.797 3.0 Americas
## 6 6 Armenia 86 0.716 2.6 East EU Cemt Asia
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point()
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "lm")
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "glm")
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_line()
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "loess")
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "loess", span = 0.3)
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "loess", span = 0.5)
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point() +
geom_smooth(method = "loess", span = 0.8)
Aesthetic mapping (i.e., with aes()) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = x) you don’t say what shapes should be used. Similarly, aes(color = z) doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2 scales include
Scales are modified with a series of functions using a scale_<aesthetic>_<type> naming scheme. Try typing scale_<tab> to see a list of scale modification functions.
The following arguments are common to most scales in ggplot2:
Specific scale functions may have additional arguments; for example, the scale_color_continuous function has arguments low and high for setting the colors at the low and high end of the scale.
Geoms that draw points have a shape parameter. Legal shape values are
For most geoms, the default shape is 16 (a dot). The shape can be set to a constant value or it can be mapped via a scale.
To set the shape to a constant value, use the shape geom parameter e.g., geom_point(data=d, mapping=aes(x=x, y=y), shape=3) sets the shape of all points in the layer to 3, which corresponds to a “+”).
Note: The ggplot2 shape parameter corresponds to the pch parameter of the R base graphics package.
Maps upto 6 distinct values to pre-defined shapes. The scale has a boolean option, “solid”, which determines whether the pre-defined set of shapes contains some solid shapes. If solid = T, the first three shapes are solid (but the fourth to sixth shape are hollow).
Note that even though the first three shapes are solid, these three shapes are not actually filled with the fill color (but they are completely drawn in the outline color).
d=data.frame(a=c("a","b","c","d","e","f"))
ggplot() + scale_x_discrete(name="") + scale_y_continuous(limits=c(0,1), breaks=NULL, name="") +
scale_shape_discrete(solid=T, guide=F) +
geom_point(data=d, mapping=aes(x=a, y=0.5, shape=a), size=10)
The scale_shape_identity scale can be used to pass through any legal shape value (its mapping is the identity function, and thus it does not change anything).
d=data.frame(p=c(0:25,32:127))
ggplot() + scale_y_continuous(name="") +
scale_x_continuous(name="") +
scale_shape_identity() +
geom_point(data=d, mapping=aes(x=p%%16, y=p%/%16, shape=p),
size=5, fill="red") +
geom_text(data=d, mapping=aes(x=p%%16, y=p%/%16+0.25, label=p), size=3)
Start by constructing a dotplot showing the distribution of home values by Date and State.
p3 <- ggplot(housing,
aes(x = State,
y = Home.Price.Index)) +
theme(legend.position="top",
axis.text=element_text(size = 6))
(p4 <- p3 + geom_point(aes(color = Date),
alpha = 0.5,
size = 1.5,
position = position_jitter(width = 0.25, height = 0)))
Now modify the breaks for the x axis and color scales
p4 + scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"))
Next change the low and high values to blue and red:
p4 +
scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = "blue", high = "red")
p4 +
scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = scales::muted("blue"), high = scales::muted("red"))
ggplot2 has a wide variety of color scales; here is an example using scale_color_gradient2 to interpolate between three different colors.
p4 +
scale_color_gradient2(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = scales::muted("blue"),
high = scales::muted("red"),
mid = "gray60",
midpoint = 1994)
Available Scales
Partial combination matrix of available scales Note that in RStudio you can type scale_ followed by TAB to get the whole list of available scales.
| Scale | Types | Examples |
|---|---|---|
| scale_color_ | identity | scale_fill_continuous_ |
| scale_fill_ | manual | scale_color_discrete_ |
| scale_size_ | continuous | scale_size_manual |
| discrete | scale_size_discrete | |
| scale_shape_ | discrete | scale_shape_discrete |
| scale_linetype_ | identity | scale_shape_manual |
| manual | scale_linetype_discrete_ | |
| scale_x_ | continuous | scale_x_continuous_ |
| scale_y_ | discrete | scale_y_discrete_ |
| reverse | scale_x_log_ | |
| log | scale_y_reverse_ | |
| date | scale_x_date_ | |
| datetime | scale_y_datetime_ |
?scale_color_manual.dat <- read.csv("dataSets/EconomistData.csv")
head(dat)
## X Country HDI.Rank HDI CPI Region
## 1 1 Afghanistan 172 0.398 1.5 Asia Pacific
## 2 2 Albania 70 0.739 3.1 East EU Cemt Asia
## 3 3 Algeria 96 0.698 2.9 MENA
## 4 4 Angola 148 0.486 2.0 SSA
## 5 5 Argentina 45 0.797 3.0 Americas
## 6 6 Armenia 86 0.716 2.6 East EU Cemt Asia
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point()
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(aes(color=Region))
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) +
geom_point(aes(color=Region))
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) +
geom_point(aes(color=Region)) +
scale_x_continuous(name="Corruption Perceptions Index") +
scale_y_continuous(name="Human Development Index")
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) +
geom_point(aes(color=Region)) +
scale_x_continuous(name="Corruption Perceptions Index") +
scale_y_continuous(name="Human Development Index") +
scale_colour_manual(
name = "Regions of the World",
values = c("red", "blue", "green", "orange", "yellow", "black"))
ggplot2 parlance for small multiplesggplot2 offers two functions for creating small multiples:facet_wrap(): define subsets as the levels of a single grouping variablefacet_grid(): define subsets as the crossing of two grouping variablesStart by using a technique we already know–map State to color:
p5 <- ggplot(housing, aes(x = Date, y = Home.Value))
p5 + geom_line(aes(color = State))
There are two problems here–there are too many states to distinguish each one by color, and the lines obscure one another.
We can remedy the deficiencies of the previous plot by faceting by state rather than mapping state to color.
(p5 <- p5 + geom_line() +
facet_wrap(~State, ncol = 10))
There is also a facet_grid() function for faceting in two dimensions.
The ggplot2 theme system handles non-data plot elements such as
Built-in themes include:
theme_gray() (default)theme_bw()theme_classic()p5 + theme_linedraw()
p5 + theme_light()
Specific theme elements can be overridden using theme(). For example:
p5 + theme_minimal() +
theme(text = element_text(color = "turquoise"))
All theme options are documented in ?theme.
You can create new themes, as in the following example:
theme_new <- theme_bw() +
theme(plot.background = element_rect(size = 1, color = "blue", fill = "black"),
text=element_text(color = "ivory"),
axis.text.y = element_text(colour = "purple"),
axis.text.x = element_text(colour = "red"),
panel.background = element_rect(fill = "pink"),
strip.background = element_rect(fill = scales::muted("orange")))
p5 + theme_new
The most frequently asked question goes something like this: I have two variables in my data.frame, and I’d like to plot them as separate points, with different color depending on which variable it is. How do I do that?
Wrong way
housing.byyear <- aggregate(cbind(Home.Value, Land.Value) ~ Date, data = housing, mean)
ggplot(housing.byyear, aes(x=Date)) +
geom_line(aes(y=Home.Value), color="red") +
geom_line(aes(y=Land.Value), color="blue")
Right way
library(tidyr)
home.land.byyear <- gather(housing.byyear, value = "value", key = "type", Home.Value, Land.Value)
ggplot(home.land.byyear, aes(x=Date, y=value, color=type)) +
geom_line()
Graph source: http://www.economist.com/node/21541178
Building off of the graphics you created in the previous exercises, put the finishing touches to make it as close as possible to the original economist graph.
dat <- read.csv("dataSets/EconomistData.csv")
head(dat)
## X Country HDI.Rank HDI CPI Region
## 1 1 Afghanistan 172 0.398 1.5 Asia Pacific
## 2 2 Albania 70 0.739 3.1 East EU Cemt Asia
## 3 3 Algeria 96 0.698 2.9 MENA
## 4 4 Angola 148 0.486 2.0 SSA
## 5 5 Argentina 45 0.797 3.0 Americas
## 6 6 Armenia 86 0.716 2.6 East EU Cemt Asia
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) +
geom_point(aes(color=Region)) +
scale_x_continuous(name="Corruption Perceptions Index") +
scale_y_continuous(name="Human Development Index") +
scale_colour_manual(values = c("red", "blue", "green", "orange", "yellow", "black"))
library("ggrepel")
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) +
geom_point(shape = 1, aes(color=Region)) +
geom_text_repel(aes(label=Region), size = 3) +
scale_x_continuous(name="Corruption Perceptions Index") +
scale_y_continuous(name="Human Development Index") +
scale_colour_manual(values = c("red", "blue", "green", "orange", "violet", "black"))
dat <- read.csv("dataSets/EconomistData.csv")
pc1 <- ggplot(dat, aes(x = CPI, y = HDI, color = Region))
pc1 + geom_point()
To complete this graph we need to: * add a trend line * change the point shape to open circle * change the order and labels of Region * label select points * fix up the tick marks and labels * move color legend to the top * title, label axes, remove legend title * theme the graph with no vertical guides * add model R2 (hard) * add sources note (hard) * final touches to make it perfect (use image editor for this)
Adding the trend line is not too difficult, though we need to guess at the model being displyed on the graph. A little bit of trial and error leads us to
Notice that we put the geom_line layer first so that it will be plotted underneath the points, as was done on the original graph.
(pc2 <- pc1 +
geom_smooth(aes(group = 1),
method = "lm",
formula = y ~ log(x),
se = FALSE,
color = "red")) +
geom_point()
This one is a little tricky. We know that we can change the shape with the shape argument, what value do we set shape to? The example shown in ?shape can help us:
## A look at all 25 symbols
df2 <- data.frame(x = 1:5 , y = 1:25, z = 1:25)
s <- ggplot(df2, aes(x = x, y = y))
s + geom_point(aes(shape = z), size = 4) + scale_shape_identity()
## While all symbols have a foreground colour, symbols 19-25 also take a
## background colour (fill)
s + geom_point(aes(shape = z), size = 4, colour = "Red") +
scale_shape_identity()
s + geom_point(aes(shape = z), size = 4, colour = "Red", fill = "Black") +
scale_shape_identity()
This shows us that shape 1 is an open circle, so
pc2 +
geom_point(shape = 1, size = 4)
That is better, but unfortunately the size of the line around the points is much narrower than on the original. This is a frustrating aspect of ggplot2, and we will have to hack around it. One way to do that is to multiple point layers of slightly different sizes.
(pc3 <- pc2 +
geom_point(size = 4.5, shape = 1) +
geom_point(size = 4, shape = 1) +
geom_point(size = 3.5, shape = 1))
This one is tricky in a couple of ways. First, there is no attribute in the data that separates points that should be labelled from points that should not be. So the first step is to identify those points.
pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan",
"Afghanistan", "Congo", "Greece", "Argentina", "Brazil",
"India", "Italy", "China", "South Africa", "Spane",
"Botswana", "Cape Verde", "Bhutan", "Rwanda", "France",
"United States", "Germany", "Britain", "Barbados",
"Norway", "Japan", "New Zealand", "Singapore")
Now we can label these points using geom_text, like this:
(pc4 <- pc3 +
geom_text(aes(label = Country),
color = "gray20",
data = subset(dat, Country %in% pointsToLabel)))
This more or less gets the information across, but the labels overlap in a most unpleasing fashion. We can use the ggrepel package to make things better, but if you want perfection you will probably have to do some hand-adjustment.
library("ggrepel")
pc3 +
geom_text_repel(aes(label = Country),
color = "gray20",
data = subset(dat, Country %in% pointsToLabel),
force = 10)
Thinkgs are starting to come together. There are just a couple more things we need to add, and then all that will be left are themeing changes. Comparing our graph to the original we notice that the labels and order of the Regions in the color legend differ. To correct this we need to change both the labels and order of the Region variable. We can do this with the factor function.
dat$Region <- factor(dat$Region,
levels = c("EU W. Europe",
"Americas",
"Asia Pacific",
"East EU Cemt Asia",
"MENA",
"SSA"),
labels = c("OECD",
"Americas",
"Asia &\nOceania",
"Central &\nEastern Europe",
"Middle East &\nnorth Africa",
"Sub-Saharan\nAfrica"))
Now when we construct the plot using these data the order should appear as it does in the original.
pc4$data <- dat
pc4
The next step is to add the title and format the axes. We do that using the scales system in ggplot2.
library(grid)
(pc5 <- pc4 +
scale_x_continuous(name = "Corruption Perceptions Index, 2011 (10=least corrupt)",
limits = c(.9, 10.5),
breaks = 1:10) +
scale_y_continuous(name = "Human Development Index, 2011 (1=Best)",
limits = c(0.2, 1.0),
breaks = seq(0.2, 1.0, by = 0.1)) +
scale_color_manual(name = "",
values = c("#24576D",
"#099DD7",
"#28AADC",
"#248E84",
"#F2583F",
"#96503F")) +
ggtitle("Corruption and Human development"))
Our graph is almost there. To finish up, we need to adjust some of the theme elements, and label the axes and legends. This part usually involves some trial and error as you figure out where things need to be positioned. To see what these various theme settings do you can change them and observe the results.
library(grid) # for the 'unit' function
(pc6 <- pc5 +
theme_minimal() + # start with a minimal theme and add what we need
theme(text = element_text(color = "gray20"),
legend.position = c("top"), # position the legend in the upper left
legend.direction = "horizontal",
legend.justification = 0.1, # anchor point for legend.position.
legend.text = element_text(size = 11, color = "gray10"),
axis.text = element_text(face = "italic"),
axis.title.x = element_text(vjust = -1), # move title away from axis
axis.title.y = element_text(vjust = 2), # move away for axis
axis.ticks.y = element_blank(), # element_blank() is how we remove elements
axis.line = element_line(color = "gray40", size = 0.5),
axis.line.y = element_blank(),
panel.grid.major = element_line(color = "gray50", size = 0.5),
panel.grid.major.x = element_blank()
))
The last bit of information that we want to have on the graph is the variance explained by the model represented by the trend line. Lets fit that model and pull out the R2 first, then think about how to get it onto the graph.
(mR2 <- summary(lm(HDI ~ log(CPI), data = dat))$r.squared)
## [1] 0.5212859
OK, now that we’ve calculated the values, let’s think about how to get them on the graph. ggplot2 has an annotate function, but this is not convenient for adding elements outside the plot area. The grid package has nice functions for doing this, so we’ll use those.
And here it is, our final version!
library(grid)
pc6
grid.text("Sources: Transparency International; UN Human Development Report",
x = .02, y = .03,
just = "left",
draw = TRUE)
grid.segments(x0 = 0.81, x1 = 0.825,
y0 = 0.90, y1 = 0.90,
gp = gpar(col = "red"),
draw = TRUE)
grid.text(paste0("R² = ",
as.integer(mR2*100),
"%"),
x = 0.835, y = 0.90,
gp = gpar(col = "gray20"),
draw = TRUE,
just = "left")